1 Introduction

In the present work I will use Gapminder Foundation’s data to address questions in the context of skill assessments of the Bioinformatics Research Network. The Gapminder data is a collection of statistical data on various indicators of global development, such as population, income, health, and education.

The available data for the assessment is a table mostly aggregated at the Year-Country level (see Preprocessing Steps section), i.e., indicators are computed with this level of granularity. The observations encompass a period from 1962 through 2007: there is one observation every five years, for each country. Among the included indicators, we have e. g.:

  1. CO2 emissions (metric tons per capita)
  2. Imports of goods and services (% of GDP)
  3. Population
  4. Energy use (kg of oil equivalent per capita)
  5. GDP per capita
  6. Population density (people per sq. km of land area)
  7. Life expectancy at birth, total (years)
  8. etc.

2 Preprocessing Steps

After working through all the questions, I realized that many rows of the original table didn’t have an associated continent. By exploring the table, I realized that under the column “Country Name” there were some values like “Middle East & North Africa”, “OECD members”, “Lower middle income”, etc. These rows presumably aggregate data of different countries that do not have just one continent associated with them. Therefore, it’s reasonable not to have continent data for these rows.

However, some rows actually correspond to countries (e.g., “Papua New Guinea”, “Monaco”, “Korea, Dem. People�s Rep.”, etc.), and these do not have an associated continent. These countries need to have their continent datum imputed since the continent variable is crucial to some of the questions in the assessment.

2.1 External source for a continent lookup table

I found a table of countries of the world and their respective continents at https://statisticstimes.com/geography/countries-by-continents.php. After donwloading it, I:

  1. Extracted a list of unique values in “Country Name” column from Gapminder table,
  2. Did the same for the table of the above mentioned website,
  3. Used ChatGPT to build a csv table with the best matching pairs of the above two lists,
  4. The job done by GPT was excellent, although there remained a few countries without a matching partner. These were very few, and I managed to correct it manually.
  5. In this way, I managed to produce a final lookup table that allows me to impute continent for all rows in Gapminder’s table, except for Virgin Islands (this case is more difficult to map, since it may refer to British V.I., which would map to Europe; or it could refer to United States V.I., which would map to Americas; I decided to leave continent as NA for this country).

3 Questions

3.1 Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing 'CO2 emissions (metric tons per capita)' and gdpPercap for the filtered data.

In order to better visualize the country-dots, here is the plot on a log-log scale:

If we didn’t display the dots on a log-log scale, we would see a tiny area of the plot highly crowded, near the axes origin, as below:

3.2 On the filtered data, calculate the correlation of 'CO2 emissions (metric tons per capita)' and gdpPercap. What is the correlation and associated p value?

By looking at the scatterplot above, we realize that computing the correlation between these two variables would yield and artificially high value, due to the presence of the data point corresponding to Kuwait ('CO2 emissions (metric tons per capita' = 42.6 and GDP per capita = 95,458.1). On the whole dataset, including Kuwait, the correlation is 0.926 and the p-value equals 1.128679e-46.

However, if we take out Kuwait, the correlation is 0.806 and the p-value: 1.082225e-25.

3.3 On the unfiltered data, answer “In what year is the correlation between 'CO2 emissions (metric tons per capita)' and gdpPercap the strongest?” Filter the dataset to that year for the next step…

Below I plot the correlation between these two variables over the years.

We can observe that the maximum correlation is attained on year 1967.

3.4 Using plotly, create an interactive scatter plot comparing 'CO2 emissions (metric tons per capita)' and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

Below is the plot, on a log-log scale, as in question 3.1:

3.5 What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'? (stats test needed)

To answer this question, I think it’s good to have an overview of the Energy Use (EU, for short) variable over the years, grouped by Continent. Below is a collection of boxplots, one per year:

After exploring the box-plots for each year and continent, I noticed that there was something strange regarding the Americas on years 1962 and 1967, and besides Africa does not appear in those years. Thus, I decided to count the rows (countries) with non-missing EU data for each combination of Year and Continent. These counts are shown in the following table:

Counts of rows (countries) for each Year-Continent combination. (only rows with non-missing EU data included)
Africa Americas Asia Europe Oceania
1962 0 2 1 20 2
1967 0 2 1 21 2
1972 25 25 32 28 2
1977 25 25 32 28 2
1982 26 25 32 28 2
1987 26 25 33 28 2
1992 28 25 40 40 2
1997 28 25 42 40 2
2002 29 25 42 41 2
2007 39 36 45 42 9

A few observations:

  1. For years 1962 and 1967, there is no EU data for Africa (as noticed before), and only one and two countries data for Asia and Americas, respectively.
  2. From 1972 onwards, the number of countries with available EU data per continent is more stable, increasing less dramatically over time.
  3. Oceania has data only for two countries from 1962 through 2002. These countries are Australia and New Zealand.
  4. The number of countries with available EU data displays a sharp increase for year 2007 w.r.t. the previous measurement (2002), especially for Africa, Americas and Oceania.

If we are going to compare energy consumption between continents, it doesn’t seem appropriate to include data from years 1962 and 1967, since this would certainly bias Americas data upwards (the only two countries reported for these two years are USA and Canada), apart from the fact that we have no data for Africa and only one country representing Asia. Thus, I decided to discard years 1962 and 1967 for the next steps.

On the other side, if we want to understand the relationship between continent and EU per capita, we need to reconcile data from different countries within each continent. Otherwise, we would weigh the same way the per-capita datum for countries with very different populations. Thus, we need to create a variable that computes the EU per capita aggregated at the continent-year level.

This can be achieved by using the other available variables in the dataset, by a sequence of steps:

  1. Computing the EU (total, not per capita) at the country-year level. For this, we will need the Population datum (variable pop) to be available. Therefore, we will need to filter out rows that don’t meet this requisite.
  2. Computing the EU (total, not per capita) at the continent-year level.
  3. Computing the population at the continent-year level.
  4. Diving EU (continent-year) by population (continent-year), to finally obtain EU per capita at the continent-year level.

By filtering out rows with non-available population datum, the country count per year and continent table looks like follows:

Counts of rows (countries) for each Year-Continent combination. (only rows with EU and population data available.)
Africa Americas Asia Europe Oceania
1972 22 23 22 25 2
1977 22 23 22 25 2
1982 23 23 22 25 2
1987 23 23 23 25 2
1992 25 23 22 29 2
1997 25 23 24 29 2
2002 26 23 24 29 2
2007 33 23 24 30 2

Now that we have excluded rows with non-available population data, we can observe that the count of countries doesn’t increase so much in 2007 w.r.t. 2002. It seems safe to keep year 2007 for the following analysis.

Finally, we can summarize these metrics in the following boxplot, displaying eight data points for each continent (one per year):

From this box-plot one can see there is a clear difference between two groups of continents. On one side, there are Africa and Asia, with a lower EU per capita, and on the other Americas, Europe and Oceania, with higher levels. The difference is striking to the eye, so it doesn’t seem strictly necessary to conduct a statistical test. However, a statistical test could answer whether there are significant differences between pairs of groups. In our case, ANOVA is not a suitable choice since the boxplot suggests that variances are not the same across groups; besides, an ANOVA would simply reveal that EU per capita is not the same across continents, which is already self-evident. As an alternative, we can use pairwise t-tests and correct for multiple comparisons.

The following table shows the p-values of such pairwise t-tests:

p-values of pairwise t-test among pairs of countries.
Africa Americas Asia Europe
Americas 0.00000 NA NA NA
Asia 0.84222 0.00000 NA NA
Europe 0.00000 0.00000 0 NA
Oceania 0.00000 0.00207 0 7e-05

By conducting pairwise t-tests, and adjusting p-values by the Benjamini-Hochberg procedure, we can observe that there is a significant difference among any pair of chosen continents, except for Africa and Asia ( adjusted p-value ~ 0.84): we cannot conclude that the mean EU per capita in Africa is different from that in Asia. To sum up, we can arrange continents in four groups according to their levels of EU per capita, and sort them in increasing order:

  1. Africa and Asia
  2. Europe
  3. Americas
  4. Oceania

3.5.1 Final remark on this question.

This dataset does not include Papua New Guinea (Oceania). This country has way less energy consumption per capita than Autralia and New Zealand (~3300 kWh vs ~66,600 kWh and ~55,100 kWh, respectively, as of 2019, according to Our World in Data: https://ourworldindata.org/grapher/per-capita-energy-use?tab=chart&country=PNG~AUS~NZL - visited on Dec 3rd, 2022). Considering that Papua New Guinea accounts for approximately 21% of Oceania’s population (https://en.wikipedia.org/wiki/List_of_Oceanian_countries_by_population, visited on Dec 3rd 2022), the inclusion of this country in the dataset would certainly lower the estimate of energy consumption per capita for Oceania. However, enriching the gapminder dataset with external sources seems to be outside the scope of the present skill assessment.

3.6 Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990? (stats test needed)

Again, similar considerations regarding the aggregation level of reported metrics are necessary here as in the previous question. Imports (% of GDP) is reported at the country-year level, and not at the continent-year level. If we want to compare this metric for two continents, we must first compute this metric at the continent-year level.

The following table displays the Imports of goods and services as a percentage of GDP, aggregated at the continent-year level:

Imports of goods and services (% GDP) for Asia and Europe since 1990.
Year Asia Europe
1992 17.7 25.4
1997 20.6 29.0
2002 22.7 31.8
2007 29.4 37.7

By looking at this table, one can observe that imports as a percentage of GDP increased for both continents in the period from 1992 through 2007, and Europe’s imports are consistently higher for each of the sampled years. Using a box-plot to visualize these data hides this consistent difference over time, since the range encompassed by both continents’ imports partially overlaps:

Thus, it seems more appropriate to compare the groups by pairing the Import values according to Year. These pairs can be compared by using a paired t-test. This test yields a significant difference between the two continents (p-value < 1.044115e-04)

3.7 What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

Monaco and Macao appear 5 times each in the 10 years sampled as the most densely populated countries in the world:

Country Name Times ranked first by Population Density
Macao SAR, China 5
Monaco 5

The following table shows in which year each of the two countries ranked as the most densely populated country in the world (indicated by a “1”)

Macao SAR, China Monaco
1962 0 1
1967 0 1
1972 1 0
1977 0 1
1982 0 1
1987 1 0
1992 1 0
1997 1 0
2002 1 0
2007 0 1

3.8 What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?

The following plot shows one line for each country: the evolution of Life Expectancy (LE, for short) over the years.

One can observe that most of the lines seem to show an increasing trend (maybe not perfectly monotonous, but certainly increasing in the long run). However, there are some countries that suffered significant setbacks, possibly because of dictatorships, civil wars, etc.

If we simply took the percentage increment between the minimum and maximum LE for each country, it’s likely that the most highly ranked countries would be examples of such tragedies. Thus, we would pick up countries based on specific historical events rather than countries that managed to increase their LE due to the implementation of public health care policies, sustained economic development, etc. Latter causes seem to be more relevant for the present assessment, while the former (dictatorships, civil wars, etc.) might be more relevant to History scholars. Therefore, I will compute other metrics for each country rather than only the percentage increment between the minimum and maximum of LE. My goal is to build indicators that will reveal which cases are examples of countries that suffered extraordinary events in the course of the years between 1962 and 2007.

The variables I intend to build are:

  1. min_year: the first year for which there is LE data for the country in question.
  2. max_year: ..you guessed it! :)
  3. year_at_min_LE: the year in the 1962 — 2007 time span when the minimum LE was attained.
  4. year_at_max_LE: analogous to year_at_min_LE, but for the maximum (enough jokes!).
  5. min_LE: minimum LE attained over all years.
  6. max_LE: maximum LE attained over all years.
  7. n_measurements: Count of years where we have available LE data for the country in question.
  8. n_years_LE_decreased: Count of years when the country in question witnessed a decrease in LE w.r.t. the previous measurement.
  9. LE_prcnt_increase: Percentage increase in LE (100*(max_LE - min_LE)/min_LE) over the whole period examined.

The following table shows the top-10 countries ordered by decreasing LE_prcnt_increase:

Country Name year_at_min_LE year_at_max_LE min_LE max_LE n_years_LE_decreased LE_prcnt_increase min_year max_year n_measurements
Cambodia 1977 2007 19.3 64.6 2 235.1 1962 2007 10
Rwanda 1992 2007 27.5 57.9 2 110.8 1962 2007 10
Timor-Leste 1977 2007 32.9 65.8 1 100.4 1962 2007 10
Bhutan 1962 2007 33.1 66.3 0 100.3 1962 2007 10
Maldives 1962 2007 38.5 75.4 0 95.9 1962 2007 10
Mali 1962 2007 28.5 54.3 0 90.1 1962 2007 10
Nepal 1962 2007 36.0 66.6 0 85.1 1962 2007 10
Gambia, The 1962 2007 32.7 58.6 0 79.3 1962 2007 10
Yemen, Rep.  1962 2007 34.7 62.0 0 78.3 1962 2007 10
Afghanistan 1962 2007 33.2 57.8 0 74.1 1962 2007 10

We can observe that the first two countries are Cambodia and Rwanda. Both have the minimum LE much later than 1962, despite both countries having LE data for 1962 (as showed by the min_year variable). Besides, these two countries saw their LE’s decrease with respect to the previous measurement on two occasions (see column n_years_LE_decreased). The years where the minimum LE is attained are 1977 and 1992 for Cambodia and Rwanda, respectively.

In the case of Cambodia, the minimum of LE around 1977 is explained by the bloody dictatorship of the Khmer Rouge and their leader Pol Pot ( https://en.wikipedia.org/wiki/Pol_Pot#Leader_of_Kampuchea, visited on Dec 4th 2022). On the other hand, Rwanda underwent a civil war between the years 1990 and 1994 (https://en.wikipedia.org/wiki/Rwandan_Civil_War visited on Dec 4th, 2022), so this seems to explain the low LE measurement.

The third country in the list, Timor-Leste, traversed years of turmoil during the 1970s: it gained independence from Portugal, underwent a short civil war, and was invaded by Indonesia in 1975. For twenty-four years, the Indonesian government subjected the people of East Timor to routine and systematic torture, sexual slavery, extrajudicial executions, massacres, and deliberate starvation. (https://en.wikipedia.org/wiki/Indonesian_occupation_of_East_Timor, visited on Dec 4th, 2022). Indeed, this country has the minimum LE on year 1977. The case of Timor-Leste also seems not to be of special interest for the present assessment.

The fourth country in the list, Bhutan, does not appear to have underwent major widespread tragedies after 1962, except for the 1990 refugee crisis (https://en.wikipedia.org/wiki/Bhutan#1990s_refugee_crisis , visited on Dec 4th 2022). LE for Bhutan displays a steady growth during this period. The overall growth in LE in the period 1962-2007 is of 100.3%, from 33.1 to 66.3 years.